AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# import libraries for data manipulation,data visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#import libraries for decision tree,metrics scores
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
make_scorer,
)
#import library to ignore warnings
import warnings
warnings.filterwarnings("ignore")
# import drive from google colab and mount it
from google.colab import drive
drive.mount ('/content/drive')
#load the csv data
data=pd.read_csv('/content/drive/MyDrive/Loan_Modelling.csv')
#display first 10 rows of dataset
data.head(10)
There are 14 columns in the dataframe. Each row represents personal details and details related to bank
#to get number of rows and columns
data.shape
There are 5000 rows and 14 columns in the Dataframe
#copy dataframe and drop ID column
df=data.copy()
df.drop('ID',axis=1,inplace=True)
Dataframe is copied and ID column is dropped since there is no significance
#getting info on dataframe
df.info()
df.describe().T
#check for null values
df.isna().sum()
There are no null values
#check for duplicated values
df.duplicated().sum()
There are no duplicate values
#convert zipcode to strings and take first 2 digits
df["ZIPCode"] = df["ZIPCode"].astype(str)
df["ZIPCode"] = df["ZIPCode"].str[0:2]
#check unique zipcode values
df["ZIPCode"].unique()
Now, Zipcode has only 7 unique values and is of datatype object.
Questions:
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# selecting numerical columns
num_col = ['Age','Experience','Family','Income','CCAvg','Mortgage']
# plot histogram and boxplot for numerical columns
for column in num_col:
histogram_boxplot(df, column, kde=True)
plt.title(f'Univariate Analysis of {column} columns')
plt.show()
#selecting categorical columns
cat_col = ["Education",
"Personal_Loan",
"Securities_Account",
"CD_Account",
"Online",
"CreditCard",
"ZIPCode"]
#plot bargraph for categorical columns
for column in cat_col:
labeled_barplot(df, column)
plt.show()
#plot heatmap for numerical categories
plt.figure(figsize=(15, 7))
sns.heatmap(df[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
df.corr()
#plot pairplot for numerical columns
sns.pairplot(data=df[num_col], diag_kind="kde")
plt.show()
#plot barplot for categorical columns with Personal Loan
sns.catplot(data=df, x="Education",y= "Personal_Loan",kind='bar')
plt.title('Education vs Personal Loan')
plt.show()
sns.catplot(data=df, x="ZIPCode",y= "Personal_Loan",kind='bar')
plt.title('Zipcode vs Personal Loan')
plt.show()
sns.catplot(data=df, x="Securities_Account",y= "Personal_Loan",kind='bar')
plt.title('Securities Account vs Personal Loan')
plt.show()
sns.catplot(data=df, x="CD_Account",y= "Personal_Loan",kind='bar')
plt.title('CD Account vs Personal Loan')
plt.show()
sns.catplot(data=df, x="Online",y= "Personal_Loan",kind='bar')
plt.title('Online vs Personal Loan')
plt.show()
sns.catplot(data=df, x="CreditCard",y= "Personal_Loan",kind='bar')
plt.title('Credit Card vs Personal Loan')
plt.show()
#plot boxplot of numerical values with Personal loan without outliers
sns.boxplot(data=df, x="Personal_Loan",y= "Age",showfliers=False)
plt.title('Age vs Personal Loan')
plt.show()
sns.boxplot(data=df, x="Personal_Loan",y= "Experience",showfliers=False)
plt.title('Experience vs Personal Loan')
plt.show()
sns.boxplot(data=df, x="Personal_Loan",y= "Income",showfliers=False)
plt.title('Income vs Personal Loan')
plt.show()
sns.boxplot(data=df, x="Personal_Loan",y= "CCAvg",showfliers=False)
plt.title('Credit card usage vs Personal Loan')
plt.show()
sns.boxplot(data=df, x="Personal_Loan",y= "Mortgage",showfliers=False)
plt.title('Mortgage vs Personal Loan')
plt.show()
sns.boxplot(data=df, x="Personal_Loan",y= "Family",showfliers=False)
plt.title('Family vs Personal Loan')
plt.show()
#plot historam of Age with and without personal loan
sns.histplot(data=df[df['Personal_Loan']==0],x='Age');
sns.histplot(data=df[df['Personal_Loan']==1],x='Age');
#plot historam of Experience with and without personal loan
sns.histplot(data=df[df['Personal_Loan']==0],x='Experience');
sns.histplot(data=df[df['Personal_Loan']==1],x='Experience');
#plot historam of Income with and without personal loan
sns.histplot(data=df[df['Personal_Loan']==0],x='Income');
sns.histplot(data=df[df['Personal_Loan']==1],x='Income');
#plot historam of Family with and without personal loan
sns.histplot(data=df[df['Personal_Loan']==0],x='Family');
sns.histplot(data=df[df['Personal_Loan']==1],x='Family');
#plot historam of Credit card spending with and without personal loan
sns.histplot(data=df[df['Personal_Loan']==0],x='CCAvg');
sns.histplot(data=df[df['Personal_Loan']==1],x='CCAvg');
#plot historam of Mortgage with and without personal loan
sns.histplot(data=df[df['Personal_Loan']==0],x='Mortgage');
sns.histplot(data=df[df['Personal_Loan']==1],x='Mortgage');
#checking unique values of experince
df["Experience"].unique()
# Treating the experience values
df["Experience"].replace(-1, 1, inplace=True)
df["Experience"].replace(-2, 2, inplace=True)
df["Experience"].replace(-3, 3, inplace=True)
#checking unique values of experince
df["Experience"].unique()
#create dummy variables with drop first to reduce columns
df_coded=pd.get_dummies(df, columns=['Education','ZIPCode'],drop_first=True)
df_coded.head(10)
#drop Experience column and split the data
x=df.drop(['Personal_Loan','Experience'],axis=1)
y=df['Personal_Loan']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=1)
#build model using gini criteria
dtree=DecisionTreeClassifier(criterion='gini',random_state=1)
dtree.fit(x_train,y_train)
#check weight distribution
print('Target variable training distribution',y_train.value_counts())
print('Target variable testing distribution',y_test.value_counts())
print('Independent variable training distribution',x_train.shape[0])
print('Independent variable testing distribution',x_test.shape[0])
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance(model, independent, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(independent)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_performance = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_performance
def make_confusion_matrix(model, independent, target, labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(independent)
cm=confusion_matrix( target, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
#create confusion matrix for training data
make_confusion_matrix(dtree, x_train, y_train)
#do model performance for training and test data; print them
print("Training data performance")
dtree_train_perf = model_performance(dtree, x_train, y_train)
print(dtree_train_perf)
print("Testing data performance")
dtree_test_perf = model_performance(dtree, x_test, y_test)
print(dtree_test_perf)
#list out independent variables
features = list(x.columns)
print(features)
#plot decision tree
plt.figure(figsize=(20,30))
tree.plot_tree(dtree,feature_names=features,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
#print decision tree
print(tree.export_text(dtree,feature_names=features,show_weights=True))
# print importance of features in tree building
print (pd.DataFrame(dtree.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
#plot importance of feature in tree building
importances = dtree.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='teal', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Pre Pruning
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {'max_depth': np.arange(3,15),
'min_samples_leaf': [2,3,5,7,9,10,12,15],
'max_leaf_nodes' : [2, 3, 5, 10],
'min_impurity_decrease': [0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(x_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(x_train, y_train)
#Training data confusion matrix
make_confusion_matrix(estimator,x_train,y_train)
print("Training data prepruned model performance")
dtree_preprun_train_perf = model_performance(estimator, x_train, y_train)
print(dtree_preprun_train_perf)
print("Testing data prepruned model performance")
dtree_preprun_test_perf = model_performance(estimator, x_test, y_test)
print(dtree_preprun_test_perf)
# prepruned tree
plt.figure(figsize=(20,30))
tree.plot_tree(estimator,feature_names=features,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
#print prepruned tree
print(tree.export_text(estimator,feature_names=features,show_weights=True))
# print importances of feature of prepruned tree
print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
#plot importances of feature of prepruned tree
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='teal', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Post Pruning
#define cost complexity pruning model
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(x_train, y_train)
#get alpha and impurity for the dataset
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
# plot imputity vs alpha for training data
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
# getting alpha of last node
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(x_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
#plot alpha vs depth of tree and alpha vs number of nodes
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
#get recall values for training and test data
recall_train = []
for clf in clfs:
pred_train = clf.predict(x_train)
train_recall = recall_score(y_train, pred_train)
recall_train.append(train_recall)
recall_test = []
for clf in clfs:
pred_test = clf.predict(x_test)
test_recall = recall_score(y_test, pred_test)
recall_test.append(test_recall)
#plot alpha vs recall
fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel("alpha")
ax.set_ylabel("recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
#get model with max alpha
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
# build model with specific alpha and fit in the data
estimator_postprun = DecisionTreeClassifier(ccp_alpha=0.0006414, class_weight='balanced', random_state=1)
estimator_postprun.fit(x_train, y_train)
#confusion matrix for postpruned training data
make_confusion_matrix(estimator_postprun,x_train,y_train)
#get performance for postpruned training data
print('Performance of postpruned training data')
dtree_postprun_train_perf = model_performance(estimator_postprun, x_train, y_train)
print(dtree_postprun_train_perf)
print('Performance of postpruned training data')
dtree_postprun_test_perf = model_performance(estimator_postprun, x_test, y_test)
print(dtree_postprun_test_perf)
# postpruned tree
plt.figure(figsize=(20,30))
tree.plot_tree(estimator_postprun,feature_names=features,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# print postpruned tree
print(tree.export_text(estimator_postprun,feature_names=features,show_weights=True))
#print importance of feature in postpruned tree
print (pd.DataFrame(estimator_postprun.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
#plot importance of feature in postpruned tree
importances = estimator_postprun.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='teal', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
# get the recall values from different models and compare
comparison_frame = pd.DataFrame({'Model':['Initial decision tree model','Decision tree with restricted maximum depth','Decision tree with post-pruning'],
'Train_Recall':[1.0,0.927,1.0], 'Test_Recall':[0.88,0.879,0.913]})
comparison_frame
Since the recall is best in post pruned decision tree, post pruned decision tree model is selelected